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Abstract 

This  paper  describes  the  development  of  DEPEND,  an  integrated  simulation  environment  for  the  design 
and  dependability  analysis  of  fault-tolerant  systems.  DEPEND  models  both  hardware  and  software 
components  at  a  functional  level,  and  allows  automatic  failure  injection  to  assess  system  performance 
and  reliability.  It  relieves  the  user  of  the  work  needed  to  inject  failures,  maintain  statistics  and  output 
reports.  The  automatic  failure  injection  scheme  is  geared  toward  evaluating  a  system  under  high  stress 
(workload)  conditions.  The  failures  which  are  injected  can  affect  both  hardware  and  software  com¬ 
ponents.  To  illustrate  the  capability  of  the  simulator,  a  distributed  system  which  employs  a  prediction- 
based,  dynamic  load-balancing  heuristic  is  evaluated.  Experiments  are  conducted  to  determine  the 
impact  of  failures  on  system  performance  and,  to  identify  the  failures  to  which  the  system  is  especially 
susceptible. 

Keywords:  Fault-tolerance,  Design,  Evaluation,  Simulation,  Fault-Injection  Distributed  Systems. 
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1.  Introduction 

The  application  of  computers  in  commercial,  military,  health  and  industrial  environments  has  increased 
rapidly  and  along  with  it  has  risen  the  need  for  these  computers  to  be  reliable  and  offer  high  perfor¬ 
mance.  Tools  are  now  needed  to  assist  in  the  design  and  dependability  analysis  of  reliable  computer 
systems.  Currently  there  are  a  few  tools  which  allow  some  automated  design  and  evaluation.  Analyti¬ 
cal  tools  like  SHARPE  [Sahner  87],  SAVE  [Goyal  86],  and  METASAN  [Sanders  86]  have  been  in  use 
for  some  time.  Recent  research  has  been  directed  toward  the  creation  of  simulators  and  test  environ¬ 
ments.  For  example,  FIAT  [Segall  88]  is  a  testing  environment  which  is  designed  to  inject  errors  into  a 
software  application  in  order  to  validate  error  detection  and  recovery  mechanisms.  OODRA  [Hwang 
89]  is  a  visually-oriented  workbench  that  is  used  for  evaluating  the  performance  and  reconfiguration 
capabilities  of  highly  concurrent  application  specific  architectures.  FOCUS  [Choi  89]  is  a  hierarchical 
mixed-mode  simulator  that  is  used  to  evaluate  the  fault-tolerance  and  reliability  of  VLSI  systems  with 
specific  emphasis  on  transient  errors.  In  [Kubiak  89],  the  authors  describe  an  event-driven  simulator 
called  GRACE  and  use  it  to  study  the  dependability  of  a  bit-serial  processing  element. 

This  paper  describes  the  development  of  DEPEND,  an  integrated  simulation  environment  for  the 
design  and  dependability  analysis  of  fault-tolerant  systems.  DEPEND  models  both  hardware  and 
software  components  at  a  functional  level,  and  allows  automatic  failure  injection  to  assess  system  per¬ 
formance  and  reliability.  Complex  hardware/software  interactions  can  also  be  studied.  The  environment 
is  designed  to  expedite  and  simplify  the  process  of  simulating  a  fault-tolerant  architecture.  It  relieves  the 
user  of  the  work  needed  to  inject  failures,  maintain  statistics  and  output  reports.  The  automatic  failure 
injection  scheme  is  geared  toward  evaluating  a  system  under  high  stress  (workload)  conditions.  The 
failures  which  are  injected  can  affect  both  the  hardware  and  the  software  components.  To  illustrate  the 
capability  of  the  simulator,  a  distributed  system  which  employs  a  prediction-based,  dynamic  load¬ 
balancing  heuristic  is  evaluated.  Experiments  are  conducted  to  determine  the  impact  of  failures  on  sys¬ 
tem  performance  and,  to  identify  the  failures  to  which  the  system  is  especially  susceptible. 

2.  The  DEPEND  Simulation  Tool 

In  DEPEND  a  library  of  objects  is  used  to  simulate  hardware  components  (e.g.,  CPUs,  communi¬ 
cation  channels  and  disks).  The  fault-tolerant  characteristics  of  an  object  are  specified  by  the  user.  The 
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degradation),  the  type  of  failures  injected  (permanent  or  transient)  and  the  method  by  which  failures  are 
injected.  Each  object  contains  routines  which  automatically  inject  failures,  maintain  a  record  of  all 
failures  injected,  keep  error  statistics  (e.g.,  mean  time  between  failures)  and  output  reports.  The 
software  components  are  modeled  by  C++  routines  written  by  the  user. 

The  simulation  environment  is  shown  in  Figure  1.  It  is  based  on  CSIM  [Schwetman  86]  which  is 
a  process-based  simulation  language  written  in  C.  The  user  sees  an  object-oriented  interface  because  the 
DEPEND  library  is  written  in  C++.  The  simulator  contains  a  viewing  system  called  PARAGRAPH, 
which  graphically  displays  the  key  performance  indicators  during  a  simulation  [Lee  89].  Both  actual 
programs  or  trace  files  from  actual  workloads  can  be  used  in  the  simulations.  The  next  subsection 


Figure  1.  The  DEPEND  simulation  environment. 
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describes  the  main  objects  defined  in  the  DEPEND  library. 

.2.1  The  Objects 

The  most  basic  object  in  the  DEPEND  library  is  called  Basicjsvr.  This  object  is  used  to  simulate 
servers  like  CPUs  and  disks  and  it  is  also  used  to  build  more  complex  objects.  Basic_svr  consists  of 
methods  which  can  be  invoked  by  a  user  to  simulate  the  functions  of  a  server,  inject  failures  and  repair 
servers.  For  example,  the  Fault  j  method  is  used  to  inject  a  failure  into  a  server.  Both  transient  and 
permanent  failures  can  be  injected.  When  a  server  is  injected  with  a  failure  it  becomes  inoperative,  (i.e., 
all  processes  using  the  server  are  deleted  and  no  others  are  accepted  until  the  server  is  repaired).  In 
addition,  event  flags  associated  with  the  server  are  set  to  notify  the  user  of  a  change  in  the  server’s 
status.  These  event  flags  can  be  monitored  by  calling  methods  like,  wait Jor Jault  0  and 
wait J or  repair  0  and  then  can  be  used  to  trigger  remedial  action  such  as  reconfiguration.  The 
No  Jault  ()  method  is  used  to  repair  a  server  or  its  spare  (for  stand-by  redundanc  ,.  By  controlling  the 
time  between  a  call  to  Fault  0  and  No  Jault  0,  the  duration  of  a  failure  can  be  controlled.  The  reserve  0, 
useQ  and  release  ()  methods  are  used  to  simulate  the  acquisition,  use  and  release  of  a  server.  In  addition 
to  these  methods,  there  are  many  others  which  allow  the  user  to  check  the  server’s  status  and  acquire 
performance  measurements  like  the  server’s  utilization,  queue  length  and  throughput.  More  complex 
objects  like  the  Distributedsystem  objects  and  the  Communication  channel  objects  are  built  using 
Basicjsvr  objects. 

A  Distributedjystem  object  simulates  a  distributed  set  of  processors.  This  object  does  not  specify 
the  connectivity  of  these  processors.  The  connectivity  is  specified  by  Network  objects  which  are  used  in 
conjunction  with  the  Distributed_system  object.  A  Distributed_system  object  consists  of  many  instances  of 
Basicjsvr  and  a  set  of  failure  injection  routines.  The  instances  of  Basic_svr  are  used  to  simulate  the  pro¬ 
cessors.  The  failure  injection  routines  automatically  inject  transient  and  permanent  failures  into  the  pro¬ 
cessors  based  on  the  specifics  of  the  injection  strategies  described  in  Section  2.2.  These  routines  are  also 
responsible  for  maintaining  a  record  of  all  injected  failures  and  for  keeping  performance  measurements. 
They  also  provide  a  full  report  of  all  injected  failures,  (e.g.,  where  and  when  failures  were  injected,  the 
mean  time  between  failures  and  the  mean  failure/recovery  duration). 

A  Network  object  is  used  to  define  the  connectivity  of  the  processors  in  a  Distributed '  system .  A 
Communication _channel  object  is  a  type  of  a  Network  object  which  simulates  a  single  bus  communication 
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channel.  It  consists  of  a  Basicsvr  object  (to  simulate  the  communication  channel),  several  Port  objects 
(to  simulate  the  I/O  ports)  and  failure  injection  routines.  Currently,  three  types  of  channel  failures  can 
be  simulated.  The  first  simply  makes  the  Communication  jharmel  inoperative,  (i.e.,  no  messages  can  be 
sent  via  the  channel).  The  second  causes  the  communication  channel  to  occasionally  lose  messages. 
The  third  failure  type  causes  the  channel  to  intermittently  corrupt  messages.  The  latter  two  failure  types 
simulate  a  ’noisy’  communication  channel. 

2.2  The  Fault  Injection  Schemes 

Three  fault  injection  strategies  are  currently  available  in  DEPEND.  All  three  are  incorporated 
into  the  Distributed_system  and  the  Communication  ^channel  objects  described  above.  In  the  first  scheme, 
faults  are  injected  at  a  constant  rate.  In  the  second  scheme,  faults  are  injected  based  on  an  exponential 
distribution;  the  duration  of  transients  is  based  on  exponential  or  normal  distributions.  The  third 
approach  injects  faults  such  that  there  is  a  high  probability  of  injections  under  heavy  workloads.  On  one 
hand  this  ensures  that  the  system  is  tested  under  stress  conditions.  On  the  other,  it  models  the 
workload/failure  dependency  observed  in  [Iyer  82]  &  [Castillo  82],  The  duration  of  transients  is  based 
on  exponential  and  normal  distributions. 

In  order  to  implement  a  workload  dependent  injection  strategy,  a  statistical  clustering  algorithm  is 
first  used  to  identify  high-density  regions  of  the  workload.  These  regions  (defined  as  states)  are  used  to 
build  a  state  transition  diagram  to  characterize  the  workload  [Hsueh  88].  Associated  with  each  state  is  a 
visit  counter  which  counts  the  number  of  visits  to  that  state.  Also  associated  is  a  fault  rate,  X,  which  the 
system  experiences  in  that  state.  Periodically,  the  workload  is  monitored  to  identify  the  workload  state 
and  to  update  the  appropriate  visit  counter.  Based  on  the  injection  interval,  the  information  from  the 
state  transition  diagram  is  used  to  estimate  a  weighted  average  failure  arrival  rate  ( Wgtjambda )  as  fol¬ 
lows: 


Wgtjambda  =  Ypisit  ratio;  x  X,  (1) 

1=1 

where: 

N  =  the  number  of  states 

.  .  counter  for  state, 

visit  ratio;  = - - — ■ — - - 

total  visits  to  all  the  states 

Once  Wgtjambda  is  determined,  it  is  used  to  compute  the  probability  of  a  failure  injection  (P  inject  (t)) 
over  the  last  interval  t  (=  injection  interval)  as  follows: 
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PJnject(t)=  *  < 


3.  The  Simulated  Distributed  System 

This  section  briefly  describes  the  distributed  system  used  to  demonstrate  some  of  the  features  of 
DEPEND.  A  detailed  description  can  be  found  in  [Goswami  89]. 

Figure  2  is  a  framework  for  the  distributed  system.  The  simulated  system  contains  a  homogene¬ 
ous  set  of  processors  connected  by  a  single  communication  channel.  The  system  is  assumed  to  have  a 
reconfiguration  mechanism  that  repairs  faulty  processors  and  restarts  the  processes,  which  were  execut¬ 
ing  on  the  processor,  within  a  short  period  of  time  (e.g.,  in  less  than  2  minutes).  Processor  0  contains 
the  central  scheduler  which  consists  of  a  predictor  and  a  scheduler.  The  predictor  uses  a  statistical 
pattern-recognition  method  to  predict  the  CPU,  I/O  and  memory  requirements  of  a  program  prior  to  its 
execution  [Devarakonda  89].  The  scheduler  executes  a  load-balancing  heuristic  called  MINQ  which 
uses  predicted  process  resource  requirements  to  determine  the  processor  load  and  send  incoming 


Figure  2.  MINQ:  Centralized  load-sharing  with  prediction. 
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processes  to  the  processor  with  the  least  load.  The  formula  used  by  MINQ  to  estimate  the  processor 
load  is  as  follows: 


CPULOADi  =  £ 


CPUREQj 


“i  CPUREQj  +IOREQj 


where: 

Ni  =  the  number  of  processes  in  processor  i 
lOREQj  is  the  predicted  I/O  requirement 
of  process  j  in  units  of  time 
CPUREQj  is  the  predicted  CPU  requirement 
of  process  j  in  units  of  time 


(2) 


When  a  process  completes  its  execution  the  processor  sends  the  actual  resources  used  by  the  pro¬ 
cess  to  the  central  scheduler  via  a  status  update  message.  This  message  is  used  to  update  the  databases 
maintained  by  the  predictor  and  decrement  the  CPU  LOAD  value  of  the  processor  which  sent  the  mes¬ 
sage. 

In  this  study,  we  define  the  workload  on  a  processor  to  be  its  CPU  utilization.  To  characterize  the 
workload,  each  processor  has  its  own  state  transition  diagram  in  which  a  state  represents  a  specific  utili¬ 
zation  level  of  the  processor.  Each  processor’s  utilization  is  measured  every  second  and,  its  state  transi¬ 
tion  diagram  is  appropriately  updated.  Every  20  seconds  the  state  transition  diagram  of  a  processor  is 
used  to  compute  the  processor’s  weighted  average  fault  rate  by  equation  1.  This  value  is  then  used  to 
determine  the  probability  of  a  fault  injection  ( P  inject {t )).  Since  this  procedure  is  followed  for  each  pro¬ 
cessor  independently,  multiple  processors  can  fail  at  a  given  time. 


4.  The  Fault  Models 

In  DEPEND,  components  are  simulated  at  a  functional  level,  therefore,  the  impact  of  physical 
faults  is  modeled  by  a  change  in  the  functional  behavior.  The  fault  models  used  in  this  study  simulate 
failures  in  the  processor  and  the  communication  channel.  Both  transient  and  permanent  failures  can  be 
injected.  The  duration  of  transients  and  intermittents  is  selected  based  on  a  normal  distribution. 

The  processor  fault  model  used  to  inject  failures  into  a  processor  is  defined  as  follows: 

1.  All  processes  executing  on  the  processor  are  ejected. 

2.  Ejected  processes  hang  until  the  processor  is  revived  and  then  are  restarted  from  the  beginning. 

3.  All  messages  sent  to  the  processor,  while  it  is  failed,  are  collected  but  not  processed  until  the  pro¬ 
cessor  is  revived. 
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4.  If  the  processor  contains  the  central  scheduler,  the  databases  maintained  by  the  predictor  and  the 

scheduler  are  erased. 

Two  fault  models  are  used  for  communication  channel  faults.  In  the  first,  a  message  loss  fault 
model,  the  communication  channel  is  assumed  to  incur  intermittent  failures  that  cause  a  specified  per¬ 
centage  of  all  messages  processed  by  the  communication  channel  to  be  lost.  A  message  that  is  lost  is 
simply  destroyed  and  not  delivered  to  its  destination.  In  the  second,  a  message  garble  fault  model,  the 
communication  channel  is  assumed  to  incur  intermittent  failures  that  corrupts  pre-specified  bytes  in  a 
message.  Only  messages  that  are  processed  by  the  communication  channel  when  the  channel  is  faulty 
(or  ’noisy’)  are  garbled. 

These  fault  models  were  selected  because  they  can  be  used  to  inject  faults  in  areas  that  are  crucial 
to  the  functionality  of  the  distributed  system  discussed  above.  A  centralized  load-balancing  heuristic  is 
especially  vulnerable  to  failures  in  the  processor  which  houses  the  scheduler  and,  to  failures  that  affect 
the  status  update  messages  received  by  the  scheduler. 

5.  The  Experiments 

Our  experiments  showed  that,  for  the  type  of  system  studied,  a  single  failure,  even  in  the  proces¬ 
sor  containing  the  central  scheduler,  has  an  insignificant  impact  on  the  response  time  so  long  as 
reconfiguration  is  achieved  within  a  small  period  of  time.  The  problems  that  impact  system  perfor¬ 
mance  quite  significantly  are  due  to  intermittents  [Iyer  90]  occurring  in  close  succession  (e.g.,  5  to  10 
failures  per  hour).  In  this  paper,  we  consider  the  impact  of  intermittent  failures. 

Table  1.  The  transition  diagram  used  by  all  processors. 


State 

Low  Util. 

Lambda  (failure/hr) 

1 

0.0 

0.4 

0.1 

2 

0.4 

0.8 

0.3 

3 

0.8 

1.0 

1.0 

The  experiments  were  conducted  on  a  ten  processor  system.  An  actual  trace  file  containing 
processes  run  on  a  VAX-11/780  was  used  as  input  to  the  simulation.  The  trace  file  was  also  used  to 
derive  the  state-transition  diagram,  shown  in  table  1,  to  characterize  the  workload  for  failure  injection 
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purposes.  The  failure  rates  shown  in  the  table  were  selected  to  create  frequent  intermittents.  For  each 
experiment,  the  simulation  was  executed  five  to  six  times  with  different  random  seeds  and  an  average  of 
these  results  is  shown  in  the  graphs  below.  The  main  performance  metric  used  in  the  study  is  the 
response  times  for  all  of  the  processes. 

5.1  Processor  Failures 

The  processor  fault  model  was  used  to  inject  failures  into  the  processors  in  the  system.  Figure  3 
shows  the  response  times  of  MINQ  for  the  10  processor  system,  when  transients  of  0,  10,  30,  60,  90  and 
120  second  duration  were  injected.  Each  simulation  lasted  about  1.5  hours  and  approximately  7  failures 
were  injected  during  this  period.  Of  these  failures,  typically  16%  were  injected  to  the  processor  contain¬ 
ing  the  central  scheduler. 

Results  in  figure  3  show  that  persistent  intermittents  degrade  system  performance  considerably. 
For  two  minute  transients,  there  was  a  46%  increase  in  the  response  time.  However,  as  stated  earlier, 
in  simulations  where  only  one  or  two  failures  were  injected,  there  was  little  or  no  performance  degrada¬ 
tion. 

5.2  Corrupted  Status  Update  Messages 


Figure  3.  Impact  of  transient  failures. 
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An  important  issue  in  the  design  of  these  systems  is  the  impact  of  corrupted  messages.  To  evalu¬ 
ate  this  effect,  the  message  garble  fault  model  was  used  in  conjunction  with  the  constant  fault  injection 
scheme  to  corrupt  status  update  messages.  Specifically,  the  fault  injections  were  designed  to  corrupt  the 
CPUREQx  field  in  the  status  update  message.  The  CPUREQx  field  is  used  by  MIN Q  to  decrement  the 
CPU _LOAD  value.  Corruption  of  the  CPUREQx  field  has  the  most  adverse  impact  on  the  database  main¬ 
tained  by  the  scheduler  and  hence  allows  a  worst  case  evaluation  of  the  system.  Figure  4  shows  the 
results  from  experiments  where  0,  5,  10  and  20%  of  the  messages  were  corrupted.  There  is  a  15% 
degradation  in  the  response  time  when  5%  of  the  messages  are  corrupted.  The  degradation  more  than 
doubles  to  35%  when  10%  of  the  messages  are  corrupted. 

5.3  Losing  Status  Update  Messages 

The  message  loss  fault  model  was  used  in  this  experiment  to  destroy  the  status  update  messages 
sent  to  the  central  scheduler.  Figure  5  is  a  graph  of  MINQ’s  response  times  when  0,  5,  10,  20  and  30% 
of  the  status  update  messages  were  destroyed.  Relative  to  figures  3  and  4,  MINQ  seems  to  be  extremely 
sensitive  to  lost  status  update  messages.  With  only  10%  of  the  messages  destroyed  there  is  nearly  a 
300%  increase  in  the  response  time. 

Upon  close  examination  it  became  apparent  that  the  poor  performance  was  not  due  to  the  predic¬ 
tor  or  the  scheduler  but  due  to  an  implementation  detail.  MINQ  uses  status  update  messages,  sent  by 


Figure  4.  Impact  of  corrupted  status  update  messages. 
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MINQ 


Figure  5.  Impact  of  lost  status  update  messages. 


the  processors,  to  decrement  the  processor  load.  When  messages  are  lost,  due  to  a  faulty  communica¬ 
tion  channel,  the  load  values  are  not  decremented.  Processor  0,  however,  houses  the  scheduler  and  does 
not  use  the  communication  channel  to  send  status  update  messages.  Hence,  processor  0’s  load  is  always 
decremented  and  appears  lower  than  that  of  the  other  processors.  This  causes  MINQ  to  assign  a  dispro¬ 
portionate  number  of  processes  to  processor  0,  resulting  in  the  extremely  poor  performance  shown  in 
Figure  5.  In  fact,  the  simulations  showed  that  processor  0  had  4  to  20  times  more  processes  assigned  to 
it  than  the  other  processors.  Thus,  faults  that  impair  the  communication  channel  for  a  reasonable  period 
of  time  and  prevent  status  update  messages  from  reaching  the  scheduler  can  cause  severe  problems 
unless  the  implementation  is  changed. 

To  reduce  this  problem,  processor  0  was  forced  to  use  the  communication  channel  when  sending 
status  update  messages.  The  experiment  was  re-run  with  this  set  up.  The  results  for  the  ten  processor 
system  are  shown  in  Figure  6.  The  sensitivity  seen  in  Figure  5  has  disappeared  because,  now  all  the  pro¬ 
cessors  lose  their  status  update  messages.  MINQ  (with  the  new  set  up)  shows  only  a  16%  increase  in  the 
response  time  when  10%  of  the  messages  are  lost  as  opposed  to  the  300%  increase  seen  in  Figure  5. 

An  additional  result  can  be  deduced  from  figures  4  and  6.  After  approximately  10%  of  the  status 
update  messages  are  lost  or  corrupted,  the  increase  in  the  response  time  levels  out.  At  this  stage,  the 
database  used  by  MINQ  is  so  corrupted  that  MINQ  seems  to  schedule  processes  randomly.  Thus, 
increasing  the  number  of  destroyed  or  corrupted  messages  does  not  further  degrade  system  performance. 


12 


Figure  6.  Impact  of  uniform  message  loss. 

In  the  experiments,  where  up  to  50%  of  the  status  update  messages  were  destroyed  the  response  time 
was  still  approximately  4.5  seconds. 

6.  Conclusion 

This  paper  presented  DEPEND,  a  simulation-based  tool  for  design  and  reliability  analysis  of  com¬ 
puter  systems.  DEPEND  consists  of  a  library  of  basic  objects  that  simulate  components  like  CPU’s, 
communication  channels  and  disks.  The  failure  characteristics  of  an  object,  such  as  the  type  of  fault- 
tolerance  mechanism  (e.g.  stand-by  redundancy  or  graceful  degradation)  used  and  the  type  of  failures 
(transients  or  permanent)  injected  can  be  specified  by  the  user.  These  objects  serve  as  the  building 
blocks  with  which  a  complex  system  can  be  simulated  for  dependability  evaluation. 

DEPEND  also  features  a  woridoad-based  fault  injection  scheme  which  ensures  an  increased  pro¬ 
bability  of  fault  injection  under  heavy  workload  conditions.  One  of  the  advantages  of  DEPEND  is  that 
it  readily  allows  the  user  to  simulate  complex  architectures  as  well  as  simulate  the  interaction  between 
the  hardware  and  the  software. 

To  illustrate  some  of  the  features  of  DEPEND,  a  distributed  system  employing  MINQ,  a 
prediction-based  centralized  load-balancing  heuristic,  was  modeled  and  analyzed  to  determine  its  suscep¬ 
tibility  to  intermittent  failures.  The  results  show  that  frequent  intermittents  cause  significant  perfor¬ 
mance  degradation.  MINQ  was  moderately  sensitive  to  corrupted  status  update  messages.  However, 
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simulations  using  DEPEND  helped  identify  an  implementation  problem  which  made  MINQ  extremely 
susceptible  to  failures  that  destroy  status  update  messages. 
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